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Data compression algorithms are generally perceived as being of interest for data communication and storage 
puiposes only. However, their use in the field of data classification and analysis is also of equal importance. 
Automatic data classification and analysis finds use in varied fields like bioinformatics, language and sequence 
recognition and authorship attribution. Different complexity measures proposed in literature like Shannon en- 
tropy, Relative entropy, Kolmogrov and Algorithmic complexity have drawbacks that make these methods inef- 
fective in analyzing short sequences that are typical in population dynamics and other fields. 

In this paper, we study Non-Sequential Recursive Pair Substitution (NSRPS), a lossless compression 
algorithm first proposed by Ebeling et al. [Math. Biosc. 52, 1980] and Jimenez-Montano et al. 
|arXiv:cond-mat/0204134. 2002]). Using this algorithm, a new complexity measure was recently proposed 
(Nagaraj et al. [arXiv:nlin.CD/l 101.4341vl, 2011]). In this work, we use NSRPS complexity measure for 
analyzing and classifying symbolic sequences generated by ID chaotic dynamical systems. Even with learning 
data-sets of length as small as 25 and test data-sets of length as small as 10, NSRPS measure is able to accu- 
rately classify the test sequence as periodic, chaotic or random. For such short data lengths, methods which use 
entropy measure and traditional lossless compression algorithm like LZ77 [A.Lempel and J.Ziv, IEEE Trans. 
Inform. Theory 22, 75 (1976)] (used for instance by Gzip, Winzip etc.) fails. 



I. INTRODUCTION 

One of the major scientific advances made in the last cen- 
tury was the discovery that information content of a message 
can be objectively measured and quantified. This measure of 
information called the Shannon entropy (H(X)) plays a ma- 
jor role in both theoretical and practical aspects of information 
theory. Entropy estimation is very useful since it is a direct 
measure of the amount of compressibility that can be achieved 
for the given sequence. As per Shannon's Lossless Coding 
theorem [1], entropy H(X) gives the bound on the maximum 
possible lossless compression that can be achieved by a loss- 
less compression algorithm. A truly random sequence would 
have maximum entropy and would thus be uncompressible. 
Entropy serves as an useful indicator of complexity of the data- 
set (higher the complexity, lesser would be the ability to com- 
press). 

Shannon's entropy measure [ 1 ] is given by the following 
expression: 

M 

H (X ) = - ^2 Pi l°g2 (Pi ) bits/symbol , ( 1 ) 

i=l 

where X is the symbolic sequence with M distinct symbols 
and pi is the probability of the i-th symbol for a block-size of 
one. Block-size refers to the number of input symbols taken 
together to compute the probability mass function. 

Entropy may be a good measure but its estimation is not 
trivial. To calculate entropy of a time series, it needs to be 
converted to a symbolic sequence which requires the use of 
partition. The choice of a partition greatly affects the entropy 
value [2]. Also, noise is another factor that increases entropy. 
The length of the sequence also has a significant impact in 
entropy estimation. For most entropy estimation algorithms, 
accurate estimation is possible only for long sequences. For 
short sequences, the above equation is not a very good esti- 
mate of entropy and we resort to the use of lossless data com- 



pression algorithms instead. In this regard, Benedetto et al. [3] 
refer to how compression algorithms like Lempel-Ziv (used 
by Gzip, zip, WinZip) may be used to estimate the complexity 
of a sequence. When LZ-77 compression algorithm encodes 
a sequence of length L with entropy H, into a zipped file of 
length L z , then limL^oo(7f ) — > H. Using this, they define 
relative entropy and find similarities between sequences for 
automatic identification of unknown sequences. Using such 
a measure, Benedetto et al. 13J have been quite successful 
in applications involving long sequences for purposes of lan- 
guage recognition, authorship attribution and language classi- 
fication. While such a measure may work for long sequences, 
it may not be suitable for chaotic sequences which are short 
and noisy. 

Due to these difficulties faced in entropy estimation, vari- 
ous other complexity measures such as Lyapunov exponent, 
Kolmogorov complexity, Algorithmic complexity, Grammar 
complexity and others have been proposed in literature fl- 
One measure of complexity of interest to us is Grammar com- 
plexity, introduced by Ebeling et al. in [5]. The idea behind 
this measure is to compress a sequence and take the length of 
the compressed sequence as a measure of the complexity of 
the sequence. 

The question to probe is, whether Shannon entropy and 
measures based on lossless compression algorithms are effec- 
tive in all situations. Nagaraj et al. (6D have shown with an 
example that Shannon entropy might not be an effective com- 
plexity measure for certain sequences of smaller lengths and 
have proposed a new measure based on the NSRPS algorithm. 
In this paper, we investigate the use of this proposed NSRPS 
complexity measure for automatic classification of sequences 
generated by chaotic maps. 

This paper is organized as follows: Section II deals with 
the NSRPS algorithm and NSRPS measure as proposed in fl. 
Section III deals with the problem of automatic classification 
and identification of periodic, chaotic and random sequences. 
Subsequently, in section IV, test results on various sequences 



2 



are provided and the paper concludes in section V with indi- 
cations of future research directions. 



H. NSRPS ALGORITHM AND NSRPS MEASURE 

Non-sequential Recursive Pair Substitution (NSRPS) was 
first proposed by Ebeling et al. Ot]. It was further improved 
by Jimenez-Montano et al. |0] and subsequently shown to be 
optimal [8]. Grassberger employed NSPRS to estimate En- 
tropy of written English [9]. The algorithm (as described in 
|0]) proceeds as follows. Let the original sequence be called 
X. At the first iteration, that pair of symbols which has max- 
imum number of occurrences is replaced with a new symbol. 
For example, the input sequence '11010010' is transformed 
into '12202' since the pair '10' has maximum number of oc- 
currences compared to other pairs ('00', '01' and '11'). In the 
second iteration, '12202' is transformed to '3202' since '12' 
has maximum frequency (in fact all pairs are equally likely). 
The algorithm proceeds in this fashion until the length of the 
string is 1 (at which stage there is no pair to substitute and the 
algorithm halts). In this example, the algorithm transforms the 
input sequence '11010010' i-> '12202' i-> '3202' i-> '402' 
'52' I-+ '6'. 

In la], it is observed that the number of iterations needed 
to transform the input sequence to a constant sequence is a 
reasonable measure of complexity of the sequence. For a con- 
stant sequence, the quantity 'entropy x length' is zero since 
entropy is zero (irrespective of length of the sequence). We 
record the number of iterations as the complexity measure N. 
In the same paper, it was shown that this measure was suc- 
cessful in distinguishing sequences from chaotic dynamical 
systems such as the Logistic map for different values of the 
bifurcation parameter. It was also demonstrated that the mea- 
sure has a high correlation with the Lyapunov exponent of the 
time series, even for very short data-lengths (as low as 50). 
Furthermore NSRPS measure is very easy to compute. 

In the next section, NSRPS measure will be used to auto- 
matically classify a sequence of unknown complexity given 
several sequences of known complexities. 



in. DATA CLASSIFICATION AND IDENTIFICATION 

The process of data classification and identification begins 
with first quantifying the information contained in a sequence. 
Having objectively quantified the information content, the 
concept of remoteness between pairs of sequences based on 
relative information content is defined and used as a distance 
metric between two sequences. Our goal is to use this dis- 
tance metric and device a method for automatically identify- 
ing unknown sequences of different complexities. The prob- 
lem at hand may be stated as follows: We are given known 
sequences (which we use as learning sequences) of four dif- 
ferent complexities of length L\ and are then provided with 
an unknown test-sequence x of length L2. We suspect that the 
test-sequence is one among the four known sequences and our 
aim is to determine the origin of the test-sequence. 



A similar kind of problem is solved by Benedetto et al. (H) 
in the field of automatic language recognition. Given two se- 
quences A and B, long as well as short sub-sequences are ex- 
tracted from A and B. An unknown short sequence x which 
is to be classified is given. New sequences are formed by ap- 
pending x with long sequences of A and B. Both the original 
and the new sequence are zipped using Gzip and the differ- 
ences in their compressed lengths are noted. Let us denote 
the differences as D\ and D2 where D\ = La+x — La and 
D-2 = Lb+x — Lb- The units of these quantities is bytes. 
x is classified as belonging to that sequence which yields the 
minimum difference. They have been able to successfully rec- 
ognize languages using an initial learning sequence of length 
1-15 kilobytes and a test (unknown) sequence x of length 20 
characters. 

There are a few drawbacks of the above approach, to name 
a few: 

• Usage of entropy requires learning. Hence the initial 
learning sequence has to be of very large length. This 
might not be practically possible for certain time series 
(for e.g. insect population counts). 

• Entropy measure is suitable only for certain data types. 
For data generated from chaotic dynamical systems, en- 
tropy might not be a good complexity measure. 

We shall demonstrate that the above method of Benedetto is 
not suitable for short sequences obtained from chaotic dynam- 
ical systems. Instead, we find that NSRPS measure is more 
successful for classifying sequences from such systems. 

IV. RESULTS 

In this section, we present results pertaining to data from 
chaotic dynamical systems. We use the popular logistic map 
y = ax(l—x) with different values of bifurcation parameter a 
(3.83 and 4) to generate long time series of different complex- 
ities. Data produced with a = 3.83 is periodic, while a = 4 
exhibits chaos as characterized by a positive lyapunov expo- 
nent. We use a partition with four bins of equal size to produce 
symbolic sequences from these time series. We also generate 
uniformly distributed random sequences so that we have three 
different learning sequences with different complexities. We 
also create three short sequences in a similar fashion. Thus we 
have a long and a short sequence for each of the three cases - 
periodic (a = 3.83), chaotic (a = 4.0) and random (uniform). 
Our task is to automatically classify the short sequences given 
the long ones. 

We shall use the following three measures of complexity 
and compare results to determine the best one: 

• First order entropy measure H(X) = 
- Sill P( x i) lo S2b( a; »)) bits/symbol 

• Gzip G — length of the zipped file in bytes ifTTll 

• NSRPS measure N = Number of iterations required by 
NSRPS algorithm to transform the given sequence into 
a constant sequence. 
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We follow a similar procedure as in |3J for finding rela- 
tive distance between sequences. Let the complexities of the 
learning sequences be Ca and Cb- After appending the short 
sequence x of unknown complexity, let the complexities be 
Ca+x and Cb+x- If [Ca+x — Ca] is lesser (greater) than 
[Cb+x — Cb], then we classify x as having the same com- 
plexity as sequence A (B). Analysis was done with learning 
sequences of length 200, 100, 75, 50 and 25 and a range of test 
sequences varying in length from 10% to 100% of the original 
learning sequence length. We have used this procedure for all 
the three complexity measures mentioned above. 

Tables HllIVI summarize the results obtained for the three 
different measures of complexity. Fifty trial runs were 
performed to come up with this result. The values in the table 
gives the success percentage for the measure (for every trial, 
if the measure is able to identify the sequence correctly, then 
it is defined as a success). To ensure complete independence 
of all the trial runs, the initial condition for each of the fifty 
trials was chosen uniformly at random from the interval [0,1). 

Here AL — > Test Sequence Length, 
H — > Entropy measure in bits / symbol, 
G — > Zipped length using Gzip in bytes, 
N -> NSRPS Measure. 

TABLE I: Results for 200 Length learning sequence. All results 
are given in terms of success percentage in classifying unknown se- 
quence across 50 trials. 
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A. Analysis of results 



TABLE II: Results for 100 Length Learning Sequence 
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96 


18 


8 


100 


90 


16 


82 


62 


78 


6 


96 


26 


12 


100 


100 


12 


88 


48 


74 


6 


96 


14 
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TABLE III: Results for 50 Length Learning Sequence 
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TABLE IV: Results for 25 Length Learning Sequence 
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From the given tables, we can infer the following: 

• For a long learning sequence (length = 200), Gzip iden- 
tifies periodic sequences better while chaotic and ran- 
dom sequences are identified much better by NSRPS 
measure. 

• For intermediate length learning sequence (length = 
100), Gzip identifies periodic sequences better in the 
case of long test sequences, while NSRPS performs 
better for smaller test sequences. While considering 



chaotic and random sequences, NSRPS outperforms the 
other two measures for all test sequences. 

• As we reduce the learning sequence lengths to 50 and 
25, it is seen that NSPRS clearly identifies periodic, 
chaotic and random sequences while the other two fail 
to do so. The only exception in this case is when en- 
tropy measure identifies chaotic sequence better than 
NSRPS measure for a learning sequence of length 25. 
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V. CONCLUSIONS AND FUTURE WORK 

There are several measures of complexity among which 
Shannon's entropy is one good measure. This measure is used 
for automatic classification of sequences of unknown com- 
plexity given longer sequences of known complexities. How- 
ever, estimating entropy measure either by use of equation (1) 
or by means of lossless data compression algorithms such as 
LZ-77 has drawbacks. One of the main drawbacks is the poor 
classification performance for short sequences. 

It is observed that Gzip measure doesn't achieve high 
accuracy for identifying chaotic and random sequences, while 
it works with medium efficiency for periodic sequences with 
a learning sequence of lengths greater than 50. This is due 
to the fact that Gzip uses a learning based algorithm and 
due to the repetitive nature of periodic sequences, learning 
can be done much sooner. Entropy measure (using equation 
(1) directly for computation) on the other hand has good 
efficiency while identifying chaotic sequences but performs 
poorly for periodic and random sequences. NSRPS is found 
to be very efficient in dealing with long, intermediate and 
short sequences. 

Future work can be along the following lines: 

• Classify sequences from several chaotic dynamical sys- 
tems such as Tent map, Skew-tent map, binary map, 



Lozi map, Henon map etc. Extend this method for 
sequences from flows (continuous dynamical systems 
such as Lorenz, Rossler etc.). 

• Study the effect of noise added to learning and test se- 
quences and analyze the performance of the complexity 
measures for classifying such noisy sequences. 

• Investigate the use of NSRPS measure for language 
classification and authorship identification applications. 

• Analyze the use of NSRPS as a lossless compression al- 
gorithm for short sequences and compare with standard 
compression algorithms like Huffman, Shanon-Fano, 
Arithmetic coding and LZW. 
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